67 research outputs found

    Street-View Image Generation from a Bird's-Eye View Layout

    Full text link
    Bird's-Eye View (BEV) Perception has received increasing attention in recent years as it provides a concise and unified spatial representation across views and benefits a diverse set of downstream driving applications. While the focus has been placed on discriminative tasks such as BEV segmentation, the dual generative task of creating street-view images from a BEV layout has rarely been explored. The ability to generate realistic street-view images that align with a given HD map and traffic layout is critical for visualizing complex traffic scenarios and developing robust perception models for autonomous driving. In this paper, we propose BEVGen, a conditional generative model that synthesizes a set of realistic and spatially consistent surrounding images that match the BEV layout of a traffic scenario. BEVGen incorporates a novel cross-view transformation and spatial attention design which learn the relationship between cameras and map views to ensure their consistency. Our model can accurately render road and lane lines, as well as generate traffic scenes under different weather conditions and times of day. The code will be made publicly available

    An investigation into automatic people counting and person re-identification

    Get PDF
    We study two video surveillance problems in this thesis including people counting and person re-identification. To address the problem of people counting, we first propose a method called Random Projection Forest to utilise rich hand-crafted features. To achieve computational efficiency and scalability, we use random forest as the regression model whose tree structure is intrinsically fast and scalable. Unlike traditional approaches to random forest construction, we embed random projection in the tree nodes to simultaneously combat the curse of dimensionality and to introduce randomness in the tree construction thus making our new method very efficient and effective. We have also developed a deep learning model for people counting. We propose a multi-task deep learning model to simultaneously predict people number and the level of crowd density, which makes our method invariant to the image scale. To deal with problem of insufficient size of training dataset, we propose an "ambiguous labelling" strategy to create various labels for the training images. In a series of experiment, we show that creating ``ambiguous label" is a simple but effective method to improve not only the deep learning model but also the Random Projection Forest model based on hand-crafted features. For the problem of person re-identification, we have developed a novel deep learning framework called Deep Augmented Attribute Network (DAAN) to learn augmented attribute features for person re-identification. We first manually label two large datasets with pre-defined mid-level semantic attributes. We then construct a deep neural network with two output branches. The first branch predicts the attributes of the input image, while the second branch generates complement features that are fused with the output of the first branch to form the augmented attributes of the input image. We optimize the attribute branch with multiple-label classification loss and apply a ’Siamese’ network structure to ensure that the augmented attributes of images from the same person are close to each other whilst those from different persons are far apart. The final learned augmented attribute features are then used for person re-identification based on Euclidean distance. As manually labelling images is a time-consuming process, we have also extended our method to datasets with only person ID information but without attribute labels. We have conducted comprehensive experiments and results show that our method outperforms state-of-the-art methods. As labelling identity and attribute for person image is time consuming, we thus propose an unsupervised method to solve person re-identification and apply it to a more challenging problem called partial person re-identification. We first use an established image segmentation method to generate superpixels to construct an Attributed Region Adjacency Graph (ARAG) in which nodes corresponding with superpixels and edges representing correlations between superpixels. We then apply region-based Normalized Cut to the graph to merge similar neighbouring superpixels in order to form natural image regions corresponding to various body parts and backgrounds. To extract feature from segmented patches, we apply a Denoising Autoencoder to learn discriminative representation of image patches in each node of the graph. Finally, the similarity of an image pair is measured by the Earth Mover's Distance (EMD) between the robust image signatures of the nodes in the corresponding ARAGs

    Deep Reinforcement Learning based Patch Selection for Illuminant Estimation

    Get PDF
    Previous deep learning based approaches to illuminant estimation either resized the raw image to lower resolution or randomly cropped image patches for the deep learning model. However, such practices would inevitably lead to information loss or the selection of noisy patches that would affect estimation accuracy. In this paper, we regard patch selection in neural network based illuminant estimation as a controlling problem of selecting image patches that could help remove noisy patches and improve estimation accuracy. To achieve this, we construct a selection network (SeNet) to learn a patch selection policy. Based on data statistics and the learning progression state of the deep illuminant estimation network (DeNet), the SeNet decides which training patches should be input to the DeNet, which in turn gives feedback to the SeNet for it to update its selection policy. To achieve such interactive and intelligent learning, we utilize a reinforcement learning approach termed policy gradient to optimize the SeNet. We show that the proposed learning strategy can enhance the illuminant estimation accuracy, speed up the convergence and improve the stability of the training process of DeNet. We evaluate our method on two public datasets and demonstrate our method outperforms state-of-the-art approaches

    V2XP-ASG: Generating Adversarial Scenes for Vehicle-to-Everything Perception

    Full text link
    Recent advancements in Vehicle-to-Everything communication technology have enabled autonomous vehicles to share sensory information to obtain better perception performance. With the rapid growth of autonomous vehicles and intelligent infrastructure, the V2X perception systems will soon be deployed at scale, which raises a safety-critical question: \textit{how can we evaluate and improve its performance under challenging traffic scenarios before the real-world deployment?} Collecting diverse large-scale real-world test scenes seems to be the most straightforward solution, but it is expensive and time-consuming, and the collections can only cover limited scenarios. To this end, we propose the first open adversarial scene generator V2XP-ASG that can produce realistic, challenging scenes for modern LiDAR-based multi-agent perception systems. V2XP-ASG learns to construct an adversarial collaboration graph and simultaneously perturb multiple agents' poses in an adversarial and plausible manner. The experiments demonstrate that V2XP-ASG can effectively identify challenging scenes for a large range of V2X perception systems. Meanwhile, by training on the limited number of generated challenging scenes, the accuracy of V2X perception systems can be further improved by 12.3\% on challenging and 4\% on normal scenes. Our code will be released at https://github.com/XHwind/V2XP-ASG.Comment: ICRA 2023, see https://github.com/XHwind/V2XP-AS
    • …
    corecore